I. Project Description

II. Preparing Steps

1. Used Libraries:

library(tidyverse) #The "tidyverse" collects some of the most versatile R packages: ggplot2, dplyr, tidyr, readr, purrr, and tibble. The packages work in harmony to clean, process, model, and visualize data.
library(skimr) #for data summary - so sweet and I like a lot this library
library(mice) #package provides a nice function md.pattern() to get a better understanding of the pattern of missing data
library(VIM) #more helpful visual representation can be obtained using the VIM package for agrr
library(naniar) #https://cran.r-project.org/web/packages/naniar/vignettes/getting-started-w-naniar.html (for gg_mis_var) (Missing values)
library(mlbench) #collection of artificial and real-world machine learning benchmark problems, including, e.g., several data sets from the UCI repository. (also has BostonHousing)
library(caret)
library(mlr)
library(tidyverse)
library(ggthemes)
library(gplots)
library(randomForest)
library(corrplot)
library(kableExtra)
library(plotly)
library(GGally) #for ggpairs
library(Boruta) #for Automated EDA later
library(readxl) #read excel
library(e1071) #Naive Bayes

2. Loading the data:

rawdata0 <- read.csv("CaseStudy2-data.csv")
head(rawdata0)
##   ID Age Attrition    BusinessTravel DailyRate             Department
## 1  1  32        No     Travel_Rarely       117                  Sales
## 2  2  40        No     Travel_Rarely      1308 Research & Development
## 3  3  35        No Travel_Frequently       200 Research & Development
## 4  4  32        No     Travel_Rarely       801                  Sales
## 5  5  24        No Travel_Frequently       567 Research & Development
## 6  6  27        No Travel_Frequently       294 Research & Development
##   DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 1               13         4    Life Sciences             1            859
## 2               14         3          Medical             1           1128
## 3               18         2    Life Sciences             1           1412
## 4                1         4        Marketing             1           2016
## 5                2         1 Technical Degree             1           1646
## 6               10         2    Life Sciences             1            733
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         73              3        2
## 2                       3   Male         44              2        5
## 3                       3   Male         60              3        3
## 4                       3 Female         48              3        3
## 5                       1 Female         32              3        1
## 6                       4   Male         32              3        3
##                  JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1        Sales Executive               4      Divorced          4403
## 2      Research Director               3        Single         19626
## 3 Manufacturing Director               4        Single          9362
## 4        Sales Executive               4       Married         10422
## 5     Research Scientist               4        Single          3760
## 6 Manufacturing Director               1      Divorced          8793
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1        9250                  2      Y       No                11
## 2       17544                  1      Y       No                14
## 3       19944                  2      Y       No                11
## 4       24032                  1      Y       No                19
## 5       17218                  1      Y      Yes                13
## 6        4809                  1      Y       No                21
##   PerformanceRating RelationshipSatisfaction StandardHours
## 1                 3                        3            80
## 2                 3                        1            80
## 3                 3                        3            80
## 4                 3                        3            80
## 5                 3                        3            80
## 6                 4                        3            80
##   StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1                1                 8                     3               2
## 2                0                21                     2               4
## 3                0                10                     2               3
## 4                2                14                     3               3
## 5                0                 6                     2               3
## 6                2                 9                     4               2
##   YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1              5                  2                       0
## 2             20                  7                       4
## 3              2                  2                       2
## 4             14                 10                       5
## 5              6                  3                       1
## 6              9                  7                       1
##   YearsWithCurrManager
## 1                    3
## 2                    9
## 3                    2
## 4                    7
## 5                    3
## 6                    7
view(rawdata0) #There are 870 entries, 36 total columns
length(rawdata0) #[1] 36
## [1] 36
skim(rawdata0) #so sweet 0- for data summary
Data summary
Name rawdata0
Number of rows 870
Number of columns 36
_______________________
Column type frequency:
factor 9
numeric 27
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
Over18 0 1 FALSE 1 Y: 870
OverTime 0 1 FALSE 2 No: 618, Yes: 252

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 435.50 251.29 1 218.25 435.5 652.75 870 ▇▇▇▇▇
Age 0 1 36.83 8.93 18 30.00 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.50 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.00 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.00 3.0 4.00 5 ▂▅▇▆▁
EmployeeCount 0 1 1.00 0.00 1 1.00 1.0 1.00 1 ▁▁▇▁▁
EmployeeNumber 0 1 1029.83 604.79 1 477.25 1039.0 1561.50 2064 ▇▇▇▇▇
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.00 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.00 66.0 83.00 100 ▇▇▆▇▇
JobInvolvement 0 1 2.72 0.70 1 2.00 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.04 1.09 1 1.00 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.71 1.11 1 2.00 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.50 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.00 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.00 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.00 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.00 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.10 1 2.00 3.0 4.00 4 ▅▅▁▇▇
StandardHours 0 1 80.00 0.00 80 80.00 80.0 80.00 80 ▁▁▇▁▁
StockOptionLevel 0 1 0.78 0.86 0 0.00 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.00 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.00 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.78 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 6.96 6.02 0 3.00 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.00 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.00 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.00 3.0 7.00 17 ▇▂▅▁▁
rawdata <- rawdata0

Then the dataset has 870 observations and 36 variables.

3. Checking for missing data:

Actually by skim(rawdata), we can see there is no missing data in the dataset. However, I will introduce some other codes that can be used to check for missing data as a reference. We only need to run one code to check for missing data.

md.pattern(rawdata)
##  /\     /\
## {  `---'  }
## {  O   O  }
## ==>  V <==  No need for mice. This data set is completely observed.
##  \  \|/  /
##   `-----'

##     ID Age Attrition BusinessTravel DailyRate Department DistanceFromHome
## 870  1   1         1              1         1          1                1
##      0   0         0              0         0          0                0
##     Education EducationField EmployeeCount EmployeeNumber
## 870         1              1             1              1
##             0              0             0              0
##     EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 870                       1      1          1              1        1
##                           0      0          0              0        0
##     JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 870       1               1             1             1           1
##           0               0             0             0           0
##     NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 870                  1      1        1                 1                 1
##                      0      0        0                 0                 0
##     RelationshipSatisfaction StandardHours StockOptionLevel
## 870                        1             1                1
##                            0             0                0
##     TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 870                 1                     1               1              1
##                     0                     0               0              0
##     YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager  
## 870                  1                       1                    1 0
##                      0                       0                    0 0
aggr_plot <- aggr(rawdata, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(rawdata), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##                  Variable Count
##                        ID     0
##                       Age     0
##                 Attrition     0
##            BusinessTravel     0
##                 DailyRate     0
##                Department     0
##          DistanceFromHome     0
##                 Education     0
##            EducationField     0
##             EmployeeCount     0
##            EmployeeNumber     0
##   EnvironmentSatisfaction     0
##                    Gender     0
##                HourlyRate     0
##            JobInvolvement     0
##                  JobLevel     0
##                   JobRole     0
##           JobSatisfaction     0
##             MaritalStatus     0
##             MonthlyIncome     0
##               MonthlyRate     0
##        NumCompaniesWorked     0
##                    Over18     0
##                  OverTime     0
##         PercentSalaryHike     0
##         PerformanceRating     0
##  RelationshipSatisfaction     0
##             StandardHours     0
##          StockOptionLevel     0
##         TotalWorkingYears     0
##     TrainingTimesLastYear     0
##           WorkLifeBalance     0
##            YearsAtCompany     0
##        YearsInCurrentRole     0
##   YearsSinceLastPromotion     0
##      YearsWithCurrManager     0
gg_miss_var(rawdata, show_pct = TRUE) + labs(title = "Percent missing of the data") + theme(legend.position = "none", plot.title = element_text(hjust = 0.5), axis.title.y = element_text(angle = 0, vjust = 1))

Then the dataset has no missing data.

4. Dropping unused columns:

We observe by skim() or view() that there are some columns without variation. Then we can drop these columns without affecting our analysis. Observing skim(), we see Over18 has all 870 observations with value Y, EmployeeCount has all 870 observations with value 1, StandardHours has all 870 observations with value 80. In addition, 18 years old is a standard working age and 80 hours/week is high (maybe per 2 weeks - employees receive paycheck per 2 weeks). Then we can drop these three columns.

drop_columns <- which(apply(rawdata, 2, function(x) (length(unique(x)) == 1)))

cols <- names(drop_columns)
rawdata <- rawdata[,-drop_columns]

#Actually, we can drop manually by another code as  rawdata <- select(rawdata, -c("Over18","EmployeeCount", "StandardHours")) . We will get the same results finally.

skim(rawdata)
Data summary
Name rawdata
Number of rows 870
Number of columns 33
_______________________
Column type frequency:
factor 8
numeric 25
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
OverTime 0 1 FALSE 2 No: 618, Yes: 252

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ID 0 1 435.50 251.29 1 218.25 435.5 652.75 870 ▇▇▇▇▇
Age 0 1 36.83 8.93 18 30.00 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.50 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.00 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.00 3.0 4.00 5 ▂▅▇▆▁
EmployeeNumber 0 1 1029.83 604.79 1 477.25 1039.0 1561.50 2064 ▇▇▇▇▇
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.00 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.00 66.0 83.00 100 ▇▇▆▇▇
JobInvolvement 0 1 2.72 0.70 1 2.00 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.04 1.09 1 1.00 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.71 1.11 1 2.00 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.50 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.00 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.00 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.00 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.00 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.10 1 2.00 3.0 4.00 4 ▅▅▁▇▇
StockOptionLevel 0 1 0.78 0.86 0 0.00 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.00 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.00 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.78 0.71 1 2.00 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 6.96 6.02 0 3.00 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.00 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.00 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.00 3.0 7.00 17 ▇▂▅▁▁

By skim(), we can check again the new dataset and all these three columns have been dropped.

I still want to drop the columns ID and EmployeeNumber. These variables are not related to Salary or Attrition and not usefull for our analysis. They are related to individual identity of each employee. After dropping, I will run skim() to check again the dataset.

rawdata <- select(rawdata, -c("ID","EmployeeNumber"))
skim(rawdata)
Data summary
Name rawdata
Number of rows 870
Number of columns 31
_______________________
Column type frequency:
factor 8
numeric 23
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
OverTime 0 1 FALSE 2 No: 618, Yes: 252

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.83 8.93 18 30.0 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.5 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.0 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.0 3.0 4.00 5 ▂▅▇▆▁
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.0 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.0 66.0 83.00 100 ▇▇▆▇▇
JobInvolvement 0 1 2.72 0.70 1 2.0 3.0 3.00 4 ▁▃▁▇▁
JobLevel 0 1 2.04 1.09 1 1.0 2.0 3.00 5 ▇▇▃▂▁
JobSatisfaction 0 1 2.71 1.11 1 2.0 3.0 4.00 4 ▅▅▁▇▇
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.5 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.0 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.0 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.0 14.0 18.00 25 ▇▅▃▂▁
PerformanceRating 0 1 3.15 0.36 3 3.0 3.0 3.00 4 ▇▁▁▁▂
RelationshipSatisfaction 0 1 2.71 1.10 1 2.0 3.0 4.00 4 ▅▅▁▇▇
StockOptionLevel 0 1 0.78 0.86 0 0.0 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.0 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.0 3.0 3.00 6 ▂▇▇▂▃
WorkLifeBalance 0 1 2.78 0.71 1 2.0 3.0 3.00 4 ▁▃▁▇▂
YearsAtCompany 0 1 6.96 6.02 0 3.0 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.0 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.0 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.0 3.0 7.00 17 ▇▂▅▁▁

Then now we have 31 columns in the dataset.

5. Pre-processing the data:

I will convert these numeric variables to factor variables.

factorcolumns <- c("JobInvolvement", "JobSatisfaction", "PerformanceRating", "RelationshipSatisfaction", "WorkLifeBalance")

rawdata[,factorcolumns] <- lapply(rawdata[,factorcolumns], as.factor)
data0 <- rawdata #data0 - dataset that I use for the analysis
skim(data0)
Data summary
Name data0
Number of rows 870
Number of columns 31
_______________________
Column type frequency:
factor 13
numeric 18
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
Attrition 0 1 FALSE 2 No: 730, Yes: 140
BusinessTravel 0 1 FALSE 3 Tra: 618, Tra: 158, Non: 94
Department 0 1 FALSE 3 Res: 562, Sal: 273, Hum: 35
EducationField 0 1 FALSE 6 Lif: 358, Med: 270, Mar: 100, Tec: 75
Gender 0 1 FALSE 2 Mal: 516, Fem: 354
JobInvolvement 0 1 FALSE 4 3: 514, 2: 228, 4: 81, 1: 47
JobRole 0 1 FALSE 9 Sal: 200, Res: 172, Lab: 153, Man: 87
JobSatisfaction 0 1 FALSE 4 4: 271, 3: 254, 1: 179, 2: 166
MaritalStatus 0 1 FALSE 3 Mar: 410, Sin: 269, Div: 191
OverTime 0 1 FALSE 2 No: 618, Yes: 252
PerformanceRating 0 1 FALSE 2 3: 738, 4: 132
RelationshipSatisfaction 0 1 FALSE 4 4: 264, 3: 261, 1: 174, 2: 171
WorkLifeBalance 0 1 FALSE 4 3: 532, 2: 192, 4: 98, 1: 48

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Age 0 1 36.83 8.93 18 30.0 35.0 43.00 60 ▂▇▇▃▂
DailyRate 0 1 815.23 401.12 103 472.5 817.5 1165.75 1499 ▇▇▇▇▇
DistanceFromHome 0 1 9.34 8.14 1 2.0 7.0 14.00 29 ▇▅▂▂▂
Education 0 1 2.90 1.02 1 2.0 3.0 4.00 5 ▂▅▇▆▁
EnvironmentSatisfaction 0 1 2.70 1.10 1 2.0 3.0 4.00 4 ▅▆▁▇▇
HourlyRate 0 1 65.61 20.13 30 48.0 66.0 83.00 100 ▇▇▆▇▇
JobLevel 0 1 2.04 1.09 1 1.0 2.0 3.00 5 ▇▇▃▂▁
MonthlyIncome 0 1 6390.26 4597.70 1081 2839.5 4945.5 8182.00 19999 ▇▅▂▁▁
MonthlyRate 0 1 14325.62 7108.38 2094 8092.0 14074.5 20456.25 26997 ▇▇▇▇▇
NumCompaniesWorked 0 1 2.73 2.52 0 1.0 2.0 4.00 9 ▇▃▂▂▁
PercentSalaryHike 0 1 15.20 3.68 11 12.0 14.0 18.00 25 ▇▅▃▂▁
StockOptionLevel 0 1 0.78 0.86 0 0.0 1.0 1.00 3 ▇▇▁▂▁
TotalWorkingYears 0 1 11.05 7.51 0 6.0 10.0 15.00 40 ▇▇▂▁▁
TrainingTimesLastYear 0 1 2.83 1.27 0 2.0 3.0 3.00 6 ▂▇▇▂▃
YearsAtCompany 0 1 6.96 6.02 0 3.0 5.0 10.00 40 ▇▃▁▁▁
YearsInCurrentRole 0 1 4.20 3.64 0 2.0 3.0 7.00 18 ▇▃▂▁▁
YearsSinceLastPromotion 0 1 2.17 3.19 0 0.0 1.0 3.00 15 ▇▁▁▁▁
YearsWithCurrManager 0 1 4.14 3.57 0 2.0 3.0 7.00 17 ▇▂▅▁▁

Then now we have 13 factor columns and 18 numeric columns in the dataset.

In the next part, I will do Exploratory Data Analysis (or EDA). First, I will analyze the dataset in each variable.

IV. EDA2 - T-test Analysis

1. t-test Analysis on Attrition variable:

t.Age <- t.test(Age~Attrition, data=data2)
t.Age
## 
##  Welch Two Sample t-test
## 
## data:  Age by Attrition
## t = 4.1509, df = 184.91, p-value = 5.05e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1.902905 5.350324
## sample estimates:
##  mean in group No mean in group Yes 
##          37.41233          33.78571
t.MonthlyIncome <-  t.test(MonthlyIncome~Attrition, data=data2)
t.MonthlyIncome
## 
##  Welch Two Sample t-test
## 
## data:  MonthlyIncome by Attrition
## t = 5.3249, df = 228.45, p-value = 2.412e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  1220.382 2654.047
## sample estimates:
##  mean in group No mean in group Yes 
##          6702.000          4764.786
t.Education <-  t.test(Education~Attrition, data=data2)
t.Education
## 
##  Welch Two Sample t-test
## 
## data:  Education by Attrition
## t = 1.4738, df = 197.94, p-value = 0.1421
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04651063  0.32165740
## sample estimates:
##  mean in group No mean in group Yes 
##          2.923288          2.785714
t.DistanceFromHome <-  t.test(DistanceFromHome~Attrition, data=data2)
t.DistanceFromHome
## 
##  Welch Two Sample t-test
## 
## data:  DistanceFromHome by Attrition
## t = -2.4218, df = 186.03, p-value = 0.01641
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.4992554 -0.3574961
## sample estimates:
##  mean in group No mean in group Yes 
##          9.028767         10.957143
t.EnvironmentSatisfaction <-  t.test(EnvironmentSatisfaction~Attrition, data=data2)
t.EnvironmentSatisfaction
## 
##  Welch Two Sample t-test
## 
## data:  EnvironmentSatisfaction by Attrition
## t = 2.1362, df = 185.27, p-value = 0.03397
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01768369 0.44474293
## sample estimates:
##  mean in group No mean in group Yes 
##          2.738356          2.507143
t.JobLevel <-  t.test(JobLevel~Attrition, data=data2)
t.JobLevel
## 
##  Welch Two Sample t-test
## 
## data:  JobLevel by Attrition
## t = 5.231, df = 211.76, p-value = 4.042e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.2995698 0.6618784
## sample estimates:
##  mean in group No mean in group Yes 
##          2.116438          1.635714
t.NumCompaniesWorked  <-  t.test(NumCompaniesWorked ~Attrition, data=data2)
t.NumCompaniesWorked 
## 
##  Welch Two Sample t-test
## 
## data:  NumCompaniesWorked by Attrition
## t = -1.6637, df = 183.57, p-value = 0.09788
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.91435932  0.07776441
## sample estimates:
##  mean in group No mean in group Yes 
##          2.660274          3.078571
t.PercentSalaryHike  <-  t.test(PercentSalaryHike ~Attrition, data=data2)
t.PercentSalaryHike
## 
##  Welch Two Sample t-test
## 
## data:  PercentSalaryHike by Attrition
## t = -0.42788, df = 187.22, p-value = 0.6692
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.8596808  0.5532229
## sample estimates:
##  mean in group No mean in group Yes 
##          15.17534          15.32857
t.StockOptionLevel  <-  t.test(StockOptionLevel ~Attrition, data=data2)
t.StockOptionLevel
## 
##  Welch Two Sample t-test
## 
## data:  StockOptionLevel by Attrition
## t = 4.216, df = 187.92, p-value = 3.86e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1845700 0.5091678
## sample estimates:
##  mean in group No mean in group Yes 
##         0.8397260         0.4928571
t.TotalWorkingYears  <-  t.test(TotalWorkingYears ~Attrition, data=data2)
t.TotalWorkingYears
## 
##  Welch Two Sample t-test
## 
## data:  TotalWorkingYears by Attrition
## t = 5.1364, df = 201.19, p-value = 6.596e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.105259 4.728792
## sample estimates:
##  mean in group No mean in group Yes 
##         11.602740          8.185714
t.TrainingTimesLastYear  <-  t.test(TrainingTimesLastYear ~Attrition, data=data2)
t.TrainingTimesLastYear
## 
##  Welch Two Sample t-test
## 
## data:  TrainingTimesLastYear by Attrition
## t = 1.8954, df = 200.36, p-value = 0.05948
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.00876413  0.44301071
## sample estimates:
##  mean in group No mean in group Yes 
##          2.867123          2.650000
t.YearsAtCompany  <-  t.test(YearsAtCompany ~Attrition, data=data2)
t.YearsAtCompany
## 
##  Welch Two Sample t-test
## 
## data:  YearsAtCompany by Attrition
## t = 3.7256, df = 191.55, p-value = 0.0002563
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9922099 3.2248155
## sample estimates:
##  mean in group No mean in group Yes 
##          7.301370          5.192857
t.YearsInCurrentRole  <-  t.test(YearsInCurrentRole ~Attrition, data=data2)
t.YearsInCurrentRole
## 
##  Welch Two Sample t-test
## 
## data:  YearsInCurrentRole by Attrition
## t = 4.9513, df = 208, p-value = 1.522e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.9306052 2.1619584
## sample estimates:
##  mean in group No mean in group Yes 
##          4.453425          2.907143
t.YearsSinceLastPromotion  <-  t.test(YearsSinceLastPromotion ~Attrition, data=data2)
t.YearsSinceLastPromotion
## 
##  Welch Two Sample t-test
## 
## data:  YearsSinceLastPromotion by Attrition
## t = 0.12796, df = 187.59, p-value = 0.8983
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5712911  0.6505475
## sample estimates:
##  mean in group No mean in group Yes 
##          2.175342          2.135714
t.YearsWithCurrManager <-  t.test(YearsWithCurrManager ~Attrition, data=data2)
t.YearsWithCurrManager
## 
##  Welch Two Sample t-test
## 
## data:  YearsWithCurrManager by Attrition
## t = 4.6826, df = 209.75, p-value = 5.084e-06
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.8262439 2.0277679
## sample estimates:
##  mean in group No mean in group Yes 
##          4.369863          2.942857
testnames <- c("Age","Monthly Income", "Education",
               "Distance From Home", "Environment Satisfaction", 
               "Job Level", "Number of Companies Worked",
               "Percent Salary Hike", "Stock Option Level",
               "Total Working Years", "Training Times Last Year",
               "Years At Company", "Years In Current Role",
               "Years Since Last Promotion", "Years With Current Manager"
               )
testpval <- c(t.Age$p.value,
              t.MonthlyIncome$p.value,
              t.Education$p.value,
              t.DistanceFromHome$p.value,
              t.EnvironmentSatisfaction$p.value,
              t.JobLevel$p.value,
              t.NumCompaniesWorked$p.value,
              t.PercentSalaryHike$p.value,
              t.StockOptionLevel$p.value,
              t.TotalWorkingYears$p.value,
              t.TrainingTimesLastYear$p.value,
              t.YearsAtCompany$p.value,
              t.YearsInCurrentRole$p.value,
              t.YearsSinceLastPromotion$p.value,
              t.YearsWithCurrManager$p.value
              )
ttestout <- cbind.data.frame(testnames,testpval)
ttestout$testpval <- round(ttestout$testpval,10) 
names(ttestout) <- c("Variable","P-Value")

The p-values of t-test analyses related to Attrition variable:

ttestout %>% kable()
Variable P-Value
Age 0.0000505
Monthly Income 0.0000002
Education 0.1421319
Distance From Home 0.0164052
Environment Satisfaction 0.0339727
Job Level 0.0000004
Number of Companies Worked 0.0978823
Percent Salary Hike 0.6692297
Stock Option Level 0.0000386
Total Working Years 0.0000007
Training Times Last Year 0.0594832
Years At Company 0.0002563
Years In Current Role 0.0000015
Years Since Last Promotion 0.8983165
Years With Current Manager 0.0000051
  • There are two groups for Attrition variable (Yes and No).
  • At \(\alpha\)-level of significance = 0.05, if p-value<0.05 then we reject the null hypothesis (There are some evidence to suggest that the mean difference in the selected variable for two groups Yes-No of Attrition variable) and if p-value>0.05 then we fail to reject the null hypothesis (There are not enough evidence to suggest that the mean difference in the selected variable for two groups Yes-No of Attrition variable).
  • By t-test analysis, we will have some ideas to build our models later.

V. EDA3 - Numeric variables correlation

In this part, we will have more information related to Monthly Income and Attrition. We will combine our analysis on each variable in the last part.

The following table will show us the correlations between numeric variables.

correlator  <-  function(df){
 df %>%
    keep(is.numeric) %>%
    tidyr::drop_na() %>%
    cor %>%
    corrplot("upper", addCoef.col = "white", number.digits = 2,
             number.cex = 0.5, method="square",
             order="hclust", 
             tl.srt=50, tl.cex = 0.5)
}
correlator(data2)

#Actually, we can remove tidyr::drop_na() %>%  because there is no missing data in this dataset

#We can write the same code directly without using function as follows
#data2 %>% keep(is.numeric) %>% na.omit %>% cor %>% corrplot("upper", addCoef.col = "white", number.digits = 2, number.cex = 0.5, method="square", order="hclust", tl.srt=50, tl.cex = 0.5)

#We can remove na.omit %>% because there is no missing data in this dataset

1. Analysis on Monthly Income

By this correlation table, we see

Relationship MonthlyIncome
JobLevel 0.95
TotalWorkingYears 0.78
YearsAtCompany 0.49
Age 0.48
YearsInCurrentRole 0.36
YearsWithCurrManager 0.33
YearsSinceLastPromotion 0.32

Then by some analyses, we can see that Monthly Income have a strong correlation with Job Level. Month Income, Total Working Years and Job Level are correlated strongly. These correlations are logical.

2. Analysis on Job Level

We have

Relationship JobLevel
MonthlyIncome 0.95
TotalWorkingYears 0.78
YearsAtCompany 0.52
Age 0.48
YearsinCurrentRole 0.39
YearsWithCurrManager 0.37
YearsSinceLastPromotion 0.33

Then by some analyses, we can see that Job Level have a strong correlation with Monthly Income. Month Income, Total Working Years and Job Level are correlated strongly. These correlations are logical.

3. Analysis on Atrrition

By EDA1 - Analysis on each variable and by the correlation plot, we can see that some numerical variables have relationship with Attrition here (MonthlyIncome ~ TotalWorkingYears, JobLevel ~ TotalWorkingYears, YearsAtCompany ~ YearsinCurrentRole, YearsWithCurrManager ~ YearsinCurrentRole, YearsWithCurrManager ~ YearsAtCompany).

We will do another analysis in the next part to decide a good model.

VI. EDA4 - Automated EDA

1. Analysis on Monthly Income

Now I will use Automated EDA for Feature Selection. I use the library Boruta here to select important variables related to MonthlyIncome variable. (http://r-statistics.co/Variable-Selection-and-Importance-With-R.html)

boruta_output <- Boruta(MonthlyIncome ~ ., data=data2, doTrace=2)
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")])  # collect Confirmed and Tentative variables
print(boruta_signif)  # significant variables
##  [1] "Age"                     "Attrition"              
##  [3] "BusinessTravel"          "Department"             
##  [5] "Education"               "JobLevel"               
##  [7] "JobRole"                 "NumCompaniesWorked"     
##  [9] "TotalWorkingYears"       "YearsAtCompany"         
## [11] "YearsInCurrentRole"      "YearsSinceLastPromotion"
## [13] "YearsWithCurrManager"
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")

Then after running this code, I have 14 important variables as follows: Age, Attrition, BusinessTravel, Department, Education, JobLevel, JobRole, NumCompaniesWorked, TotalWorkingYears, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager.

2. Analysis on Attrition

Similarly, I will run the following code

boruta_output <- Boruta(Attrition ~ ., data=rawdata0, doTrace=2)
boruta_signif <- names(boruta_output$finalDecision[boruta_output$finalDecision %in% c("Confirmed", "Tentative")])  # collect Confirmed and Tentative variables
print(boruta_signif)  # significant variables
##  [1] "Age"                  "Department"           "JobInvolvement"      
##  [4] "JobLevel"             "JobRole"              "JobSatisfaction"     
##  [7] "MaritalStatus"        "MonthlyIncome"        "NumCompaniesWorked"  
## [10] "OverTime"             "StockOptionLevel"     "TotalWorkingYears"   
## [13] "WorkLifeBalance"      "YearsAtCompany"       "YearsInCurrentRole"  
## [16] "YearsWithCurrManager"
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Variable Importance")

Then I will use important variable to build a model later: Age, Department, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, NumCompaniesWorked, OverTime, StockOptionLevel, TotalWorkingYears, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager.

VII. EDA5 - Multiple Linear Regression for Salary

Here we will use our EDA4 to build model for Monthly Income.

1. Build a Model

I will build a multiple linear regression on Monthly Income related to 14 important variables in the last part.

set.seed(100)
data3 <- select(rawdata, "MonthlyIncome", "Age", "Attrition", "BusinessTravel", "Department", "Education", "JobLevel", "JobRole", "NumCompaniesWorked", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
splitPerc1 = 0.8
trainIndices1 = sample(1:dim(data3)[1],round(splitPerc1 * dim(data3)[1]))
train1 = data3[trainIndices1,]
test1 = data3[-trainIndices1,]
lm1 = lm(MonthlyIncome ~ ., data = train1)
summary(lm1)
## 
## Call:
## lm(formula = MonthlyIncome ~ ., data = train1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3599.9  -618.2   -28.7   586.9  3976.5 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      -460.4385   584.7780  -0.787 0.431341    
## Age                                -1.6469     6.3150  -0.261 0.794330    
## AttritionYes                      104.0117   117.0640   0.889 0.374588    
## BusinessTravelTravel_Frequently   191.2012   157.9819   1.210 0.226599    
## BusinessTravelTravel_Rarely       409.2843   135.5348   3.020 0.002625 ** 
## DepartmentResearch & Development  228.2193   474.6301   0.481 0.630789    
## DepartmentSales                  -351.1497   481.9957  -0.729 0.466541    
## Education                         -30.3332    41.4089  -0.733 0.464101    
## JobLevel                         2791.0173    93.4593  29.863  < 2e-16 ***
## JobRoleHuman Resources            -33.9349   558.5571  -0.061 0.951573    
## JobRoleLaboratory Technician     -676.1669   191.3353  -3.534 0.000437 ***
## JobRoleManager                   4328.4088   305.8894  14.150  < 2e-16 ***
## JobRoleManufacturing Director     180.1699   192.5615   0.936 0.349790    
## JobRoleResearch Director         4127.5414   247.1638  16.700  < 2e-16 ***
## JobRoleResearch Scientist        -355.1583   192.6310  -1.844 0.065663 .  
## JobRoleSales Executive            570.9446   391.4210   1.459 0.145129    
## JobRoleSales Representative        95.3766   431.1777   0.221 0.825004    
## NumCompaniesWorked                 -3.4078    18.8530  -0.181 0.856614    
## TotalWorkingYears                  49.3545    12.1666   4.057 5.56e-05 ***
## YearsAtCompany                      1.9822    15.2147   0.130 0.896380    
## YearsInCurrentRole                 -0.6587    18.5037  -0.036 0.971613    
## YearsSinceLastPromotion            30.7322    17.2900   1.777 0.075945 .  
## YearsWithCurrManager              -45.0684    18.8133  -2.396 0.016867 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1066 on 673 degrees of freedom
## Multiple R-squared:  0.9492, Adjusted R-squared:  0.9475 
## F-statistic: 571.4 on 22 and 673 DF,  p-value: < 2.2e-16
pred1 = predict(lm1, test1)
RMSE = sqrt(mean((test1$MonthlyIncome - pred1)^2))
RMSE
## [1] 1031.816

Then RMSE is around $1031.816.

2. Validation Requirement for Salary

Here I will use the file CaseStudy2CompSet No Salary.xlsx and create new dataframe case2nosalary1 with the same variables as the dataframe data3 in the last step (Build a Model) (without MonthlyIncome variable).

case2nosalary <- read_excel("CaseStudy2CompSet No Salary.xlsx")
case2nosalary1 <- select(case2nosalary, "Age", "Attrition", "BusinessTravel", "Department", "Education", "JobLevel", "JobRole", "NumCompaniesWorked", "TotalWorkingYears", "YearsAtCompany", "YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")

Now I will use the model built in the last step to predicting the salary for the file CaseStudy2CompSet No Salary.xlsx.

lm2 = lm(MonthlyIncome ~ ., data = data3)
pred2 = predict(lm2, case2nosalary1)


MonthlyIncome = pred2
MonthlyIncome = as.data.frame(MonthlyIncome)

case2nosalary2 = cbind(case2nosalary, MonthlyIncome)


write.csv(case2nosalary2, "Case2PredictionsNguyenSalary.csv", row.names = FALSE)

case2nosalary3 <- read.csv("Case2PredictionsNguyenSalary.csv", header = T)

head(case2nosalary3)
##    ID Age Attrition    BusinessTravel DailyRate             Department
## 1 871  43        No Travel_Frequently      1422                  Sales
## 2 872  33        No     Travel_Rarely       461 Research & Development
## 3 873  55       Yes     Travel_Rarely       267                  Sales
## 4 874  36        No        Non-Travel      1351 Research & Development
## 5 875  27        No     Travel_Rarely      1302 Research & Development
## 6 876  39       Yes     Travel_Rarely       895                  Sales
##   DistanceFromHome Education   EducationField EmployeeCount EmployeeNumber
## 1                2         4    Life Sciences             1           1849
## 2               13         1    Life Sciences             1            995
## 3               13         4        Marketing             1           1372
## 4                9         4    Life Sciences             1           1949
## 5               19         3            Other             1           1619
## 6                5         3 Technical Degree             1             42
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       1   Male         92              3        2
## 2                       2 Female         53              3        1
## 3                       1   Male         85              4        4
## 4                       1   Male         66              4        1
## 5                       4   Male         67              2        1
## 6                       4   Male         56              3        2
##                 JobRole JobSatisfaction MaritalStatus MonthlyRate
## 1       Sales Executive               4       Married       19246
## 2    Research Scientist               4        Single       17241
## 3       Sales Executive               3        Single        9277
## 4 Laboratory Technician               2       Married        9238
## 5 Laboratory Technician               1      Divorced       16290
## 6  Sales Representative               4       Married        3335
##   NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1                  1      Y       No                20                 4
## 2                  3      Y       No                18                 3
## 3                  6      Y      Yes                17                 3
## 4                  1      Y       No                22                 4
## 5                  1      Y       No                11                 3
## 6                  3      Y       No                14                 3
##   RelationshipSatisfaction StandardHours StockOptionLevel
## 1                        3            80                1
## 2                        1            80                0
## 3                        3            80                0
## 4                        2            80                0
## 5                        1            80                2
## 6                        3            80                1
##   TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1                 7                     5               3              7
## 2                 5                     4               3              3
## 3                24                     2               2             19
## 4                 5                     3               3              5
## 5                 7                     3               3              7
## 6                19                     6               4              1
##   YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1                  7                       7                    7
## 2                  2                       0                    2
## 3                  7                       3                    8
## 4                  4                       0                    2
## 5                  7                       0                    7
## 6                  0                       0                    0
##   MonthlyIncome
## 1      5665.948
## 2      2730.876
## 3     12189.751
## 4      1954.560
## 5      2356.053
## 6      6102.791

Later, I will upload the preditecd file Case2PredictionsNguyenSalary.csv for Salary into github.

VIII. EDA6 - Naive Bayes classifiers for Attrition

Here I will use our EDA4 to build model for Attrition. I will use these vatiables: Age, Department, EnvironmentSatisfaction, JobInvolvement, JobLevel, JobRole, JobSatisfaction, MaritalStatus, MonthlyIncome, NumCompaniesWorked, OverTime, StockOptionLevel, TotalWorkingYears, WorkLifeBalance, YearsAtCompany, YearsInCurrentRole, YearsWithCurrManager.

1. Build a Model

I will use Naive Bayes classifiers to build a model here.

data4 <- select(rawdata, "Attrition", "Age", "Department",  "EnvironmentSatisfaction",  "JobInvolvement",  "JobLevel", "JobRole",  "JobSatisfaction", "MaritalStatus",  "MonthlyIncome", "NumCompaniesWorked",  "OverTime", "StockOptionLevel",  "TotalWorkingYears", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager")


set.seed(10000)
splitPercNB1 = 0.8
trainIndicesNB1 = sample(1:dim(data4)[1],round(splitPercNB1 * dim(data4)[1]))
trainNB1 = data4[trainIndicesNB1,]
testNB1 = data4[-trainIndicesNB1,]

NB1 <- naiveBayes(Attrition~.,data = trainNB1,laplace = -1)
predNB1 = predict(NB1, testNB1)

table(as.factor(testNB1$Attrition),predNB1)
##      predNB1
##        No Yes
##   No  125  26
##   Yes   8  15
#Confusion Matrix
confusion.Matrix = confusionMatrix(predNB1,as.factor(testNB1$Attrition))
confusion.Matrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  125   8
##        Yes  26  15
##                                           
##                Accuracy : 0.8046          
##                  95% CI : (0.7378, 0.8607)
##     No Information Rate : 0.8678          
##     P-Value [Acc > NIR] : 0.992836        
##                                           
##                   Kappa : 0.3604          
##                                           
##  Mcnemar's Test P-Value : 0.003551        
##                                           
##             Sensitivity : 0.8278          
##             Specificity : 0.6522          
##          Pos Pred Value : 0.9398          
##          Neg Pred Value : 0.3659          
##              Prevalence : 0.8678          
##          Detection Rate : 0.7184          
##    Detection Prevalence : 0.7644          
##       Balanced Accuracy : 0.7400          
##                                           
##        'Positive' Class : No              
## 
Sensitivity = confusion.Matrix$byClass['Sensitivity']
Specificity = confusion.Matrix$byClass['Specificity']
Accuracy = confusion.Matrix$overall['Accuracy']

Accuracy
##  Accuracy 
## 0.8045977
Sensitivity
## Sensitivity 
##   0.8278146
Specificity
## Specificity 
##   0.6521739

2. Validation Requirement for Attrition

Here I will use the file CaseStudy2CompSet No Attrition.csv and create new dataframe case2noattrition1 with the same variables as the dataframe data4 in the last step (Build a Model) (without Attrition variable).

case2noattrition <- read.csv("CaseStudy2CompSet No Attrition.csv")
case2noattrition1 <- select(case2noattrition, "Age", "Department",  "EnvironmentSatisfaction",  "JobInvolvement",  "JobLevel", "JobRole",  "JobSatisfaction", "MaritalStatus",  "MonthlyIncome", "NumCompaniesWorked",  "OverTime", "StockOptionLevel",  "TotalWorkingYears", "WorkLifeBalance", "YearsAtCompany", "YearsInCurrentRole", "YearsWithCurrManager")

Now I will use the model built in the last step to predicting Attrition for the file CaseStudy2CompSet No Attrition.csv.

set.seed(10000)

NB2 <- naiveBayes(Attrition~.,data = data4,laplace = -1)
predNB2 = predict(NB2, case2noattrition1)
## Warning in predict.naiveBayes(NB2, case2noattrition1): Type mismatch
## between training and new data for variable 'JobInvolvement'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(NB2, case2noattrition1): Type mismatch
## between training and new data for variable 'JobSatisfaction'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(NB2, case2noattrition1): Type mismatch
## between training and new data for variable 'WorkLifeBalance'. Did you use
## factors with numeric labels for training, and numeric values for new data?
Attrition = predNB2
Attrition = as.data.frame(Attrition)
case2noattrition2 = cbind(case2noattrition,Attrition)

write.csv(case2noattrition2,"Case2PredictionsNguyenAttrition.csv", row.names = FALSE)

case2noattrition3 <- read.csv("Case2PredictionsNguyenAttrition.csv", header = T)

head(case2noattrition3)
##     ID Age    BusinessTravel DailyRate             Department
## 1 1171  35     Travel_Rarely       750 Research & Development
## 2 1172  33     Travel_Rarely       147        Human Resources
## 3 1173  26     Travel_Rarely      1330 Research & Development
## 4 1174  55     Travel_Rarely      1311 Research & Development
## 5 1175  29     Travel_Rarely      1246                  Sales
## 6 1176  51 Travel_Frequently      1456 Research & Development
##   DistanceFromHome Education  EducationField EmployeeCount EmployeeNumber
## 1               28         3   Life Sciences             1           1596
## 2                2         3 Human Resources             1           1207
## 3               21         3         Medical             1           1107
## 4                2         3   Life Sciences             1            505
## 5               19         3   Life Sciences             1           1497
## 6                1         4         Medical             1            145
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2   Male         46              4        2
## 2                       2   Male         99              3        1
## 3                       1   Male         37              3        1
## 4                       3 Female         97              3        4
## 5                       3   Male         77              2        2
## 6                       1 Female         30              2        3
##                     JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1     Laboratory Technician               3       Married          3407
## 2           Human Resources               3       Married          3600
## 3     Laboratory Technician               3      Divorced          2377
## 4                   Manager               4        Single         16659
## 5           Sales Executive               3      Divorced          8620
## 6 Healthcare Representative               1        Single          7484
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1       25348                  1      Y       No                17
## 2        8429                  1      Y       No                13
## 3       19373                  1      Y       No                20
## 4       23258                  2      Y      Yes                13
## 5       23757                  1      Y       No                14
## 6       25796                  3      Y       No                20
##   PerformanceRating RelationshipSatisfaction StandardHours
## 1                 3                        4            80
## 2                 3                        4            80
## 3                 4                        3            80
## 4                 3                        3            80
## 5                 3                        3            80
## 6                 4                        3            80
##   StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1                2                10                     3               2
## 2                1                 5                     2               3
## 3                1                 1                     0               2
## 4                0                30                     2               3
## 5                2                10                     3               3
## 6                0                23                     1               2
##   YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1             10                  9                       6
## 2              5                  4                       1
## 3              1                  1                       0
## 4              5                  4                       1
## 5             10                  7                       0
## 6             13                 12                      12
##   YearsWithCurrManager Attrition
## 1                    8       Yes
## 2                    4       Yes
## 3                    0       Yes
## 4                    2       Yes
## 5                    4       Yes
## 6                    8       Yes

Later, I will upload the preditecd file Case2PredictionsNguyenAttrition.csv for Salary into github.